Background & Context
The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
You need to identify the best possible model that will give the required performance
Objective
Data Dictionary:
CLIENTNUM: Client number. Unique identifier for the customer holding the account
Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
Customer_Age: Age in Years
Gender: Gender of the account holder
Dependent_count: Number of dependents
Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
Marital_Status: Marital Status of the account holder
Income_Category: Annual Income Category of the account holder
Card_Category: Type of Card
Months_on_book: Period of relationship with the bank
Total_Relationship_Count: Total no. of products held by the customer
Months_Inactive_12_mon: No. of months inactive in the last 12 months
Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
Credit_Limit: Credit Limit on the Credit Card
Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
Total_Trans_Amt: Total Transaction Amount (Last 12 months)
Total_Trans_Ct: Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
Avg_Utilization_Ratio: Represents how much of the available credit the customer spent
Import sklearn and python libraries needed to achieve our objectives
!pip install shap
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings('ignore')
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Libraries to split data, impute missing values
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
# Library to import Logistic Regression
from sklearn.linear_model import LogisticRegression
# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.ensemble import (
BaggingClassifier,
RandomForestClassifier,
GradientBoostingClassifier
)
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
# Libtune to tune model, get different metric scores
from sklearn import metrics
from sklearn.metrics import (
confusion_matrix,
classification_report,
accuracy_score,
precision_score,
recall_score,
f1_score,
roc_auc_score
)
# Libraries for hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
# Libraries for k fold and cross validation score
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Libraries to creat pipelines
from sklearn.pipeline import Pipeline
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Remove limits for the number of displayed columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)
# Library for SHAP
import shap
# Import libraries to create custom transformer
from sklearn.base import BaseEstimator, TransformerMixin
# Importing library to mount my google drive to access data
from google.colab import drive
drive.mount('/content/drive')
# Defining file path and saving data as a pandas dataframe
path = '/content/drive/MyDrive/Colab Notebooks/PGP in AI and ML/Project_5/BankChurners.csv'
data = pd.read_csv(path)
df = data.copy()
# Size of dataset
print(f'Dataset has {df.shape[0]} rows and {df.shape[1]} columns')
# reading the first 5 rows of our dataset
df.head()
# Display last five rows
df.tail()
Based off this quick view, we can see that the dataset has mixed datatypes (int, floats and categorical)
# Display feature info
df.info()
# Look at data summary for numerical values
df.describe().T
# Review summary of the non numerical features
df.describe(exclude='number').T
This section will look at the unique values of the different features
# Lets see how many unique values the client number has
df['CLIENTNUM'].value_counts()
Unique value for each entry so dataset covers different clients. For this, the client number is not useful so we will drop the column from the dataset
# Dropping client number column
df.drop(['CLIENTNUM'], axis=1, inplace=True)
# Display first five rows to verify column was dropped
df.head()
# Converting object datatypes to categorical to help with modeling, but also to make it easeir for me when exploring the dataset
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
for col in cat_cols:
df[col] = df[col].astype('category')
# Look at dtype info to verify conversion was correct
df.info()
# Creating list of numerical features
num_cols = df.select_dtypes(include=['number']).columns.tolist()
# Printing the unique values and counts for each categorical feature
for col in cat_cols:
print('Unique values and counts for {}:'.format(col))
print(df[col].value_counts())
print('_ _'*50)
# Examining the entries for Income Category that has 'abc' as an entry
df[df['Income_Category']=='abc']
Looking at the entries that had abc income, it's safe to assume this was either an error or a way of ommitting income data. This will be replaces with NaN
# Replacing 'abc' with np.nan
df.Income_Category.replace('abc', np.nan, inplace=True)
# Examine unique values for numerical features
#Compute number of unique entries for each feature
unique_values = df.select_dtypes(include='number').nunique().sort_values()
# Plot unique values for numerical features
plt.figure(figsize=(15,4))
sns.barplot(x=unique_values.index, y=unique_values.values)
plt.title('Unique Values per Numerical Feature')
plt.xticks(rotation=45)
plt.show()
From the chart, we can see our top three category with the most unique entries is:
This section will look and see if our dataset has any duplicates
# Checking dataframe for duplicates
df.duplicated().sum()
We have no duplicates in our dataset
This section will examine the features that have missing values. It will not treat the missing values however.
# Checking for missing values in our dataset
(round(df.isna().sum() / df.isna().count(),2) * 100).sort_values(ascending=False)
Education level, income category and marital status have missing entries
This section will cover data preprocessing, exploratory data analysis (univariate and bivariate)
# Barplot for Attrition Flag
labeled_barplot(df, 'Attrition_Flag',perc=True)
# Barplot for Gender
labeled_barplot(df, 'Gender', perc=True)
# Barplot for Education Level
labeled_barplot(df, 'Education_Level', perc=True)
In descending order for highest level of education:
# Barplot for marital status
labeled_barplot(df, 'Marital_Status', perc=True)
46% of the customer base is married, less than 40% is single and less than 10% is single
# Barplot for Income Category
labeled_barplot(df, 'Income_Category', perc=True)
Majority of the customers make less than 40k dollars a year by a large amount
# Barplot for Card Category
labeled_barplot(df, 'Card_Category', perc=True)
Overwhelming majority are blue card members
# Histogram and boxplot for Customer Age
histogram_boxplot(df, 'Customer_Age', kde=True)
# Histogram and boxplot for number of dependents
histogram_boxplot(df, 'Dependent_count', kde = True)
# Histogram and boxplot for customer relationship with the bank
histogram_boxplot(df, 'Months_on_book')
# Univariate analysis for total number of products held by the customer
histogram_boxplot(df, 'Total_Relationship_Count')
# Histogram and boxplot for Months inactive
histogram_boxplot(df, 'Months_Inactive_12_mon')
# Histogram and boxplot for No. of Contacts between the customer and bank in the last 12 months
histogram_boxplot(df, 'Contacts_Count_12_mon')
# Histogram and boxplot for Credit Limit
histogram_boxplot(df, 'Credit_Limit')
# Histogram and boxplot for balance that carries over to the next month
histogram_boxplot(df, 'Total_Revolving_Bal')
# histogram and boxplot amount left on the credit card to use
histogram_boxplot(df, 'Avg_Open_To_Buy')
# Histogram and boxplot for total transaction amount in 4th quarter and the total transaction amount in 1st quarter
histogram_boxplot(df, 'Total_Amt_Chng_Q4_Q1')
# Histogram and boxplot for total transactions in the last 12 months
histogram_boxplot(df, 'Total_Trans_Amt')
# Histogram and boxplot for Total number of transactions
histogram_boxplot(df, 'Total_Trans_Ct')
# Histogram and boxplot for the ratio of total number of transactions in Q4 and Q1
histogram_boxplot(df,'Total_Ct_Chng_Q4_Q1')
# Histogram and boxplot for average utilization ratio
histogram_boxplot(df, 'Avg_Utilization_Ratio')
# Calculating feature correlation to be used for the correlation matrix
df_corr = df.corr()
# Create labels to make the Strong, Medium and Weak correlations more clear
# I will define strong correlation as > 0.75
# I will define medium correlation as > 0.50
# I will define weak correlation as > 0.25
labels = np.where(
np.abs(df_corr) > 0.75,
'S',
np.where(np.abs(df_corr) > 0.5, 'M',
np.where(np.abs(df_corr) > 0.25, 'W', '')),
)
# Plot correlation matrix with the diagonals masked
plt.figure(figsize=(15,15))
sns.heatmap(
df_corr,
mask=np.eye(len(df_corr)),
square=True,
center=0,
annot=labels,
fmt='',
linewidths=0.5,
cmap='viridis',
cbar_kws={'shrink':0.8},
vmin=-1,
vmax=1
);
Strong correlation between:
Medium correlation between:
Weak correlation between:
# Let's take a deeper dive into the feature correlations with pair plots
sns.pairplot(data=df, hue='Attrition_Flag', diag_kind='kde');
Key Observations from pairplot:
# Importing plotly to make the following data analysis easier
import plotly.express as px
# Taking another look at the months on book and customer age scatter plot to investigate the horizontal line we observed
px.scatter(data_frame=df, x='Customer_Age', y='Months_on_book', )
# Display rows where month on book equals 36 and age is less than 40
df[df['Months_on_book']==36][df['Customer_Age']<40]
# Display rows where month on book equals 36 and age is greater than 56
df[df['Months_on_book']==36][df['Customer_Age']>56]
I'm going to choose the latter for now
# Dropping avg_open_to_buy column
df.drop('Avg_Open_To_Buy', axis=1, inplace=True)
num_cols
# Updating our numerical columns list
num_cols.remove('Avg_Open_To_Buy')
# Stacked histogram for Gender vs Attrition Flag
sns.histplot(df, x='Gender', hue='Attrition_Flag', multiple='stack');
# Stacked histogram for Education Level vs Attrition Flag
sns.histplot(df, x='Education_Level', hue='Attrition_Flag', multiple='stack');
# Stacked histogram for Marital Status vs Attrition Flag
sns.histplot(df, x='Marital_Status', hue='Attrition_Flag', multiple='stack');
We see almost an equal amount of married and single customers attrited even though there's a greater number of marrried customers
# Stacked histogram for Income Category vs Attrition Flag
sns.histplot(df, x='Income_Category', hue='Attrition_Flag', multiple='stack');
See the same pattern for attrited customers as the overall distribution. Highest attrition in customers who make less than 40k dollars
# Stacked histogram for Card Category vs Attrition Flag
sns.histplot(df, x='Card_Category', hue='Attrition_Flag', multiple='stack');
Largest attrition in blue card members but a large imbalanced favoring blue card members
This section is to examine the relationship between our target value (Attrition Flag) and the numerical features
# Number of dependents vs Attrition
distribution_plot_wrt_target(df, 'Dependent_count', 'Attrition_Flag')
# Months with bank vs Attrition
distribution_plot_wrt_target(df, 'Months_on_book', 'Attrition_Flag')
# Total number of products with bank vs Attrition
distribution_plot_wrt_target(df, 'Total_Relationship_Count', 'Attrition_Flag')
# Months Inactive vs Attrition
distribution_plot_wrt_target(df, 'Months_Inactive_12_mon', 'Attrition_Flag')
# Contacts_Count_12_mon vs Attrition
distribution_plot_wrt_target(df, 'Contacts_Count_12_mon', 'Attrition_Flag')
# Credit Limit vs Attrition
distribution_plot_wrt_target(df, 'Credit_Limit', 'Attrition_Flag')
# Total Revolving balance vs Attrition
distribution_plot_wrt_target(df, 'Total_Revolving_Bal', 'Attrition_Flag')
Customers with higher revolving balance (>~1400) tend to stay on as customers
# Total_Amt_Chng_Q4_Q1 vs Attrition
distribution_plot_wrt_target(df, 'Total_Amt_Chng_Q4_Q1','Attrition_Flag')
Customers with ratios less than ~0.63 tend to attrite
# Total_Trans_Amt vs Attrition
distribution_plot_wrt_target(df, 'Total_Trans_Amt', 'Attrition_Flag')
# Total_Trans_Ct vs Attrition
distribution_plot_wrt_target(df, 'Total_Trans_Ct', 'Attrition_Flag')
Customers with low number of transactions attrite. Likely alluding to their lack of use.
# Total_Ct_Chng_Q4_Q1 vs Attrition
distribution_plot_wrt_target(df, 'Total_Ct_Chng_Q4_Q1', 'Attrition_Flag')
Customers with lower ratios tend to attrite
# Avg_Utilization_Ratio vs Attrition
distribution_plot_wrt_target(df, 'Avg_Utilization_Ratio', 'Attrition_Flag')
This section will go through the feature set and look at how the demographic for the different card members looks like. I'll save the demographics until the end
# Card Cat vs Gender
sns.catplot(x='Gender', hue='Attrition_Flag', col='Card_Category', data=df, kind='count')
# Printing value counts to better analyze the last three boxes
pd.DataFrame(round(df.groupby(['Card_Category'])['Gender'].value_counts(1),2))
# Card Cat vs Education Level
sns.catplot(x='Education_Level', hue='Attrition_Flag', col='Card_Category', data=df, kind='count')
# Printing value counts to better analyze the last three boxes
pd.DataFrame(round(df.groupby(['Card_Category'])['Education_Level'].value_counts(1),2))
# Card Cat vs Marital Status
sns.catplot(x='Marital_Status', hue='Attrition_Flag', col='Card_Category', data=df, kind='count')
# Printing value counts to better analyze the last three boxes
pd.DataFrame(df.groupby(['Card_Category'])['Marital_Status'].value_counts(1))
# Card Cat vs Income Category
sns.catplot(x='Income_Category', hue='Attrition_Flag', col='Card_Category', data=df, kind='count')
# Printing value counts to better analyze the last three boxes
pd.DataFrame(df.groupby(['Card_Category'])['Income_Category'].value_counts(1))
# Let's take a deeper dive into how the card categories breakout based on the numerical features
# Plotting boxplots of our numerical data
plt.figure(1,figsize=(15,12))
for i,name in enumerate(num_cols):
plt.subplot(4,4,i+1)
sns.boxplot(data=df, x=name, y='Card_Category')
plt.tight_layout()
plt.title(name)
plt.show()
Note: Functions for plotting histogram and box plots, labeled barplots, etc are located in the Appendix
Below I've listed out the key obersvations made from our univariate and bivariate EDA and I also looked at the card category demographic. I'll note that there was an imbalance in the card category so the Gold, Platinum and Silver member demographics needs more sample points to further refine the demographic's characteristics
Key Observations:
Generalizations for different card members: Blue Card Members:
Gold Card Member:
Platinum Card Member:
Silver Card Member:
This section will focus on:
# Encoding our target variable (Attrition_Flag) with 1 for attrited customer and 0 for existing customer
encode = {'Attrited Customer': 1, 'Existing Customer': 0}
df['Attrition_Flag'].replace(encode, inplace=True)
# Creating a copy of the data to build the model
df1 = df.copy()
# Separating target and dependent features
# Dependent features
X = df1.drop('Attrition_Flag',axis=1)
X = pd.get_dummies(X)
# Target feature
y = df1['Attrition_Flag']
# Split our data into train, val and test sets
# First splitting our data set into a temp and test set
X_temp, X_test, y_temp, y_test = train_test_split(X,y, test_size=0.2, random_state=1, stratify=y)
# Now we're splitting our temporary set into train and val
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
# Take a look at the number of observations for each split
print('Number of rows for train datat=', X_train.shape[0])
print('Number of rows for validation data=',X_val.shape[0])
print('Number of rows for test data=',X_test.shape[0])
There's a few options I considered for outlier treatment:
I've decided to log transform my data. One reason is because I want to see if this transformation helps alleviate the outliers we noticed in the cusomter age and months on book scatter plot
# Defining my columns that need log transformation
cols_to_log = ['Customer_Age', 'Months_on_book', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
# Transform Training data
for colname in cols_to_log:
X_train[colname + '_log'] = np.log(X_train[colname] + 1)
X_train.drop(cols_to_log, axis=1, inplace=True)
# Transform validation data
for colname in cols_to_log:
X_val[colname + '_log'] = np.log(X_val[colname] + 1)
X_val.drop(cols_to_log, axis=1, inplace=True)
# Transform test data
for colname in cols_to_log:
X_test[colname+'_log'] = np.log(X_test[colname]+1)
X_test.drop(cols_to_log, axis=1, inplace=True)
This section will treat the columns we observed earlier with missing values: Education level, income category and marital status
# For this dataset, we have three categorical features with missing values so I will employ a simple imputer to replace with the most frequent
imputer = SimpleImputer(strategy='most_frequent')
impture = imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_val = imputer.transform(X_val)
X_test = imputer.transform(X_test)
Note: functions to calculate model performance and confusion matrix is located in the Appendix
Model can make incorrect predictions by:
Business case approach:
Metric of Interest:
I've selected the following 6 models to build and evaluate based on preference from previous projects: Logistic Regression, Decision Tree, Bagging Classifier, RandomForest Classifier, Gradient Boost Classifier, XGBoost Classifier
# Building 6 models and evaluating using KFOld and cross_val_score (will leverage approach from Mentor Learning Session)
# Create list of models
models = [
('Logistic Regression', LogisticRegression(random_state=1)),
('Decision Tree', DecisionTreeClassifier(random_state=1)),
('Bagging Classifier', BaggingClassifier(random_state=1)),
('Random Forest', RandomForestClassifier(random_state=1)),
('Gradient Boost', GradientBoostingClassifier(random_state=1)),
('XGBoost', XGBClassifier(random_state=1, eval_metrics='logloss'))
]
# Create empty list for results and model names
results =[]
names = []
print('Cross-validation performance:\n')
for name, model in models:
# setting our kfold parameter to 10 folds
kfold = StratifiedKFold(
n_splits=10,
shuffle=True,
random_state=1,
)
# Calculate cross validation score
cv_result = cross_val_score(
estimator=model,
X=X_train,
y=y_train,
scoring='recall',
cv=kfold
)
results.append(cv_result)
names.append(name)
print(f'{name}: {round(cv_result.mean(),2)*100}')
print('--'*50)
print('Training Performance: ')
# Loop through models, fit to training data and calculate training recall score
for name, model in models:
model.fit(X_train, y_train)
pred = model.predict(X_train)
scores = recall_score(y_train, pred)*100
print(f'{name}: {scores}')
# Printing validation performance for each of the models
# Creating a DataFrame to capture all the models validation scores
val_score = {}
for name, model in models:
val_score[name] = model_performance_classification_sklearn(model, X_val, y_val)
print(f'Performance for {name}:')
print('Validation performance: \n',val_score[name])
print('--'*50)
Building the same 6 model types with oversampled data using Synthetic Minority Over Sampling Technique (SMOTE)
# Resample dataset and oversample using SMOTE
sm = SMOTE(
sampling_strategy=1,
k_neighbors=5,
random_state=1,
)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
# Look at data before and after over sampling
print(f"Before Oversampling, counts of label 'Yes': {sum(y_train == 1)}")
print(f"Before Oversampling, counts of label 'No': {sum(y_train == 0)} \n")
print(f"After Oversampling, counts of label 'Yes': {sum(y_train_over == 1)}")
print(f"After Oversampling, counts of label 'No': {sum(y_train_over == 0)} \n")
# Look at the size of the oversampled dataset. Previous split had 6,075 rows
print(f"After Oversampling, the shape of train_X: {X_train_over.shape}")
print(f"After Oversampling, the shape of train_y: {y_train_over.shape} \n")
Size of training set nearly doubled
# Building 6 models with the oversampled data and evaluating performance
# Create list of models
models = [
('Logistic Regression_Oversampled', LogisticRegression(random_state=1)),
('Decision Tree_Oversampled', DecisionTreeClassifier(random_state=1)),
('Bagging Classifier_Oversampled', BaggingClassifier(random_state=1)),
('Random Forest_Oversampled', RandomForestClassifier(random_state=1)),
('Gradient Boost_Oversampled', GradientBoostingClassifier(random_state=1)),
('XGBoost_Oversampled', XGBClassifier(random_state=1, eval_metrics='logloss'))
]
print('Training Performance: ')
# Loop through models, fit to training data and calculate training recall score
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = model_performance_classification_sklearn(model, X_train_over, y_train_over)
print(f'{name}: \n{scores}')
print('--'*50)
print('Validation Performance: ')
# Loop through models and see how they perform on the validation set. Scores will be saved in our dict
for name, model in models:
val_score[name] = model_performance_classification_sklearn(model, X_val, y_val)
print(f'{name}:')
print(val_score[name])
Build 6 models using undersampled data
# Undersampling andr resampling our dataset
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
# Look at data before and after under sampling
print(f'Before undersampling, counts of lablel "Yes": {sum(y_train==1)}')
print(f'After undersampling, counts of label "No": {sum(y_train==0)}')
print(f'After undersampling, counts of label "Yes": {sum(y_train_un==1)}')
print(f'After undersampling, counts of label "No": {sum(y_train_un==0)}\n')
# Look at the size of the undersampled dataset. Previous split had 6,075 rows
print(f"After undersampling, the shape of train_X: {X_train_un.shape}")
print(f"After undersampling, the shape of train_y: {y_train_un.shape} \n")
# Build 6 models with the undersampled data and evaluate performance
models = [
('Logistic Regression_Undersampled', LogisticRegression(random_state=1)),
('Decision Tree_Undersampled', DecisionTreeClassifier(random_state=1)),
('Bagging Classifier_Undersampled', BaggingClassifier(random_state=1)),
('Random Forest_Undersampled', RandomForestClassifier(random_state=1)),
('Gradient Boost_Undersampled', GradientBoostingClassifier(random_state=1)),
('XGBoost_Undersampled', XGBClassifier(random_state=1, eval_metrics='logloss'))
]
# Loop through the models, fit to undersampled data and calculate performance metrics
print('Training Performance:\n')
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = model_performance_classification_sklearn(model, X_train_un, y_train_un)
print(f'{name}: \n{scores}')
print('--'*50)
# Loop through the models and see how they perform on the validation set. Scores will be saved in our dict
print('Validation Performance:\n')
for name, model in models:
val_score[name] = model_performance_classification_sklearn(model, X_val, y_val)
print(f'{name}:')
print(val_score[name])
For tuning, I will select the best performer from the three different categories we built (Baseline, Oversampled, Undersampled). The three models I selected are:
%%time
# Define baseline model from above
model = GradientBoostingClassifier(random_state=1)
# Define grid or parameters to tune
parameters = {
'n_estimators': [100,150,200,250],
'subsample': [0.8, 0.9,1],
'max_features':[0.7,0.9,1],
'min_samples_split': np.arange(5,20,5)
}
# Type of score to use
scorer = metrics.make_scorer(metrics.recall_score)
# Run Random Search
rand_obj = RandomizedSearchCV(estimator=model, param_distributions=parameters, scoring=scorer, cv=5, n_jobs=-1, n_iter=50, random_state=1)
# Fit parameters in rand_obj
rand_obj.fit(X_train, y_train)
# Look and see what the best parameters of the randomized search is
print('Best parameters for Gradient Boost Classifier based off the randomized search are:')
print(pd.Series(rand_obj.best_params_))
# Build model with best parameters
gbc_tuned = GradientBoostingClassifier(
random_state=1,
subsample=0.9,
n_estimators=250,
min_samples_split=10,
max_features= 0.9
)
# Fit model to training data
gbc_tuned.fit(X_train, y_train)
# Calculating our models performance on the training set
# Create an empty dict to track training score
train_score = {}
print('GBC Training Performance:')
train_score['Gradient Boost_Tuned'] = model_performance_classification_sklearn(gbc_tuned, X_train, y_train)
train_score['Gradient Boost_Tuned']
# Create confusion matrix
confusion_matrix_sklearn(gbc_tuned, X_train, y_train)
# Calculate model performance on validation set
val_score['Gradient Boost_Tuned'] = model_performance_classification_sklearn(gbc_tuned, X_val, y_val)
print('Validation Performance: ')
val_score['Gradient Boost_Tuned']
# Create confusion matrix
confusion_matrix_sklearn(gbc_tuned, X_val, y_val)
# Define our model
xgb_tuned = XGBClassifier(random_state=1, eval_metrics='logloss')
# Define grid of parameters to use for Randomized Search
parameters = {
"n_estimators": [10,30,50,150,250],
"scale_pos_weight":[1,2,5],
"subsample":[0.7,0.9,1],
"learning_rate":[0.05, 0.1,0.2],
"colsample_bytree":[0.7,0.9,1],
"colsample_bylevel":[0.5,0.7,1]
}
# Set our scoring metric
scorer = metrics.make_scorer(metrics.recall_score)
# Call RandomizedSearchCV
rand_obj = RandomizedSearchCV(estimator=xgb_tuned, param_distributions=parameters, n_jobs=-1, n_iter=50, scoring=scorer, cv=5, random_state=1)
# Fit parameters in rand_obj
rand_obj.fit(X_train_over, y_train_over)
# Look and see what the best parameters of the randomized search is
print('Best parameters for XGBoost based off the randomized search are:')
print(pd.Series(rand_obj.best_params_))
# Build model with best parameters
xbg_tuned = XGBClassifier(
random_state=1,
subsample=0.9,
scale_pos_weight=5,
n_estimators=50,
learning_rate= 0.05,
colsample_bytree= 0.90,
colsample_bylevel= 1
)
# Fit model on training data
xbg_tuned.fit(X_train_over, y_train_over)
# Calculating our models performance on the training set
print('XBC Training Performance:')
train_score['XGBoost_Tuned'] = model_performance_classification_sklearn(xbg_tuned, X_train_over, y_train_over)
train_score['XGBoost_Tuned']
# Create confusion matrix
confusion_matrix_sklearn(xbg_tuned, X_train_over, y_train_over)
# Calculate model performance on validation set
val_score['XGBoost_Tuned'] = model_performance_classification_sklearn(xbg_tuned, X_val, y_val)
print('Validation Performance:')
val_score['XGBoost_Tuned']
# Create confusion matrix
confusion_matrix_sklearn(xbg_tuned, X_val, y_val)
# define our model
lr_tuned = LogisticRegression(random_state=1)
# define our grid of parameters
parameters = {
'C': np.arange(0.1,1.1, 0.1),
'penalty': ['l1', 'l2', 'elasticnet', 'none'],
'solver': ['lbfgs','newton-cg','liblinear', 'saga', 'sag'],
'max_iter': [100, 200, 250, 500, 1000]
}
# Call on randomized cv
rand_obj = RandomizedSearchCV(estimator = lr_tuned, param_distributions=parameters, n_jobs=-1, n_iter=50, scoring=scorer, cv=5, random_state=1)
# Fit model on training data
rand_obj.fit(X_train_un, y_train_un)
# Look and see what the best parameters of the randomized search is
print('Best parameters for Logistic Regression based off the randomized search are:')
print(pd.Series(rand_obj.best_params_))
# Build model with best parameters
lr_tuned = LogisticRegression(
random_state=1,
solver='newton-cg',
penalty='none',
C=0.5,
max_iter=1000
)
# Fit model on undersampled training data
lr_tuned.fit(X_train_un ,y_train_un)
# Check training performance
train_score['Logistic Regression_Tuned'] = model_performance_classification_sklearn(lr_tuned, X_train_un, y_train_un)
train_score['Logistic Regression_Tuned']
# Create confusion matrix
confusion_matrix_sklearn(lr_tuned, X_train_un, y_train_un)
# Calculate the validation performance
val_score['Logistic Regression_Tuned'] = model_performance_classification_sklearn(lr_tuned, X_val, y_val)
print('Validation Performance: ')
val_score['Logistic Regression_Tuned']
# Create confusion matrix
confusion_matrix_sklearn(lr_tuned, X_val, y_val)
We see improved performance on the val set too and we see good generilzation
Compare the model performance of tuned models and choose the best model
# Print model scores for comparision
cols = ['Gradient Boost_Tuned', 'XGBoost_Tuned', 'Logistic Regression_Tuned']
for col in cols:
print(f'{col} Training and Validatin Performance: ')
print(train_score[col])
print(val_score[col])
print('--'*50)
Based off the performances above, I would select Gradient Boost Classfier. The model was able to generalize and be able to score well on the Recall. This model gives me the highest confidence in predicting which customer is likely to leave
# Check selected model performance on test set
model_performance_classification_sklearn(gbc_tuned, X_test, y_test)
# Identify and plot feature importance
feature_names = X.columns
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Top features by importance:
Let's utilize SHAP to explore contributions of features and compare to sklearn method
# Initialize package
shap.initjs()
# Calculate SHAP values
explainer = shap.TreeExplainer(gbc_tuned)
shap_values = explainer.shap_values(X)
# Plot shap values
shap.summary_plot(shap_values, X)
Impacts of top features:
Create final model using pipelines. I will create two different pipelines: one for numerical and one for categorical columns.
For numerical, I will do missing value imputation and log transformation as pre-processing. I'll note here that I did run my previous models with and without log transformations and received similar results so I could leave that out.
For categorical, I will do one hot encoding and missing value imputation
# Create list of numerical variables
numerical_features = [
'Customer_Age',
'Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon',
'Credit_Limit',
'Total_Revolving_Bal',
'Total_Amt_Chng_Q4_Q1',
'Total_Trans_Amt',
'Total_Trans_Ct',
'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio'
]
# Create list of categorical variables
categorical_features = [
'Gender',
'Education_Level',
'Marital_Status',
'Income_Category',
'Card_Category'
]
# Create class for log transformation to include in pipeline using sklearn's base package
class LogTransform(BaseEstimator, TransformerMixin):
def fit(self,X, y=None):
self.log = np.log(X+1)
return self
def transform(self, X, y=None):
return np.log(X+1)
def fit_transform(self, X, y=None):
return np.log(X+1)
# Create transformer for numerical variables to apply log transform and simple imputer
numeric_transformer = Pipeline(
steps=[('transform', LogTransform()),
('imputer', SimpleImputer(strategy='median'))]
)
# Create transformer for categorical varialbes to apply one hot encoder and simple imputer
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
# Similar to our MLS, I included handle_unknown = 'ignore' to handle any unknown category in the test data
# Combine categorical and numerical transformers using column transformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
],
remainder="passthrough",
)
# Split Data into target and independent variables
X = df1.drop('Attrition_Flag', axis=1)
y = df1['Attrition_Flag']
# Split data into train and test
# Note: Decided not to create a validation set since I do not need to compare models
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
# Creating new pipeline with best parameters
model = Pipeline(
steps=[
("pre", preprocessor),
(
"GBC",
GradientBoostingClassifier(
random_state=1,
subsample=0.9,
n_estimators=250,
min_samples_split=10,
max_features=0.9
),
),
]
)
# Fit the model on training data
model.fit(X_train, y_train)
# use model to predict on test set
model.predict(X_test)
# calculate model performance on train set
model_performance_classification_sklearn(model,X_train, y_train)
# Calculate model performance on test set
model_performance_classification_sklearn(model, X_test, y_test)
Business recommendations and insights
Key Observations:
Recommendations for improvements:
Section to contain functions
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")